arXiv: 1505.04141 v2 [cs.CV] 18 May 2015 


Published in the International Journal of Computer Vision (IJCV), April 2015: 
http://dx.doi.org/10.1007/si1263-015-0814-0 


WhittleSearch: Interactive Image Search with Relative Attribute 
Feedback 

Adriana Kovashka • Devi Parikh • Kristen Grauman 


Received: date / Accepted: date 


Abstract We propose a novel mode of feedback for im¬ 
age search, where a user describes which properties of ex¬ 
emplar images should be adjusted in order to more closely 
match his/her mental model of the image sought. For ex¬ 
ample, perusing image results for a query “black shoes”, 
the user might state, “Show me shoe images like these, but 
sportier^ Offline, our approach first learns a set of ranking 
functions, each of which predicts the relative strength of a 
nameable attribute in an image (e.g., sportiness). At query 
time, the system presents the user with a set of exemplar im¬ 
ages, and the user relates them to his/her target image with 
comparative statements. Using a series of such constraints in 
the multi-dimensional attribute space, our method iteratively 
updates its relevance function and re-ranks the database of 
images. To determine which exemplar images receive feed¬ 
back from the user, we present two variants of the approach: 
one where the feedback is user-initiated and another where 
the feedback is actively system-initiated. In either case, our 
approach allows a user to efficiently “whittle away” irrel¬ 
evant portions of the visual feature space, using semantic 
language to precisely communicate her preferences to the 
system. We demonstrate our technique for refining image 
search for people, products, and scenes, and we show that it 
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outperforms traditional binary relevance feedback in terms 
of search speed and accuracy. In addition, the ordinal na¬ 
ture of relative attributes helps make our active approach 
efficient—both computationally for the machine when se¬ 
lecting the reference images, and for the user by requiring 
less user interaction than conventional passive and active 
methods. 

Keywords Content-based image search • Interactive image 
search • Active selection • Relative attributes 


1 Introduction 


In image search, the user often has a mental picture of his 
or her desired content. For example, a shopper wants to re¬ 
trieve those catalog pages that match his envisioned style of 
clothing; a witness wants to help law enforcement locate a 
suspect in a database based on his memory of the face; a 
web page designer wants to find a stock photo suitable for 
her customer’s brand image. Therefore, a central challenge 
is how to allow the user to convey that mental picture to the 
system. Due to the well known “semantic gap”—which sep¬ 
arates the system’s low-level image representation from the 
user’s high-level concept—retrieval through a single user 
interaction, i.e., a one-shot query, is generally insufficient. 
Keywords alone are clearly not enough; even if all existing 
images were tagged to enable keyword search, it is infeasi¬ 
ble to pre-assign tags sufficient to satisfy any future query 
a user may dream up. Indeed, vision algorithms are neces¬ 
sary to further parse the content of images for many search 
tasks. Advances in image descriptors, learning algorithms, 
and large-scale indexing have all had impact in recent years. 

The key to overcoming the gap appears to be interac¬ 
tive search techniques that allow a user to iteratively refine 
the results retrieved by the system ( |Cox et al| 2000[ [Kurita 
and Kato[ |I993t |Rui et al[ |1998t |Zhou and Huang[ 2003 
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Fig. 1 Main idea: Allow users to give relative attribute feedback on 
reference images to refine their image search. 


Ferecatu and Geman||2007t[Zavesky and Chang[[2008| ). The 


basic idea is to show the user candidate results, obtain feed¬ 
back, and adapt the system’s relevance ranking function ac¬ 
cordingly. However, existing image search methods provide 
only a narrow channel of feedback to the system. Typically, a 
user refines the retrieved images via binary feedback on ex¬ 


1993 

[Cox et al 2000 Rui et al 

1998 Zhou and Huang 

2003 

[Ferecatu and Geman[ 2007 

), or else attempts to tune 


system parameters such as weights on a small set of low 
level features (e.g., texture, color, edges) ( [Flickner et al| 1995 


Ma and Manjunath[|1997[|Iqbal and Aggarwal 2002| ). The 
latter is clearly a burden for a user who likely cannot under¬ 
stand the inner workings of the algorithm. The former feed¬ 
back is more natural to supply, yet it leaves the system to 
infer what about those images the user found relevant or ir¬ 
relevant, and therefore can be slow to converge on the user’s 
target in practice. The semantic gap between low-level vi¬ 
sual cues and the high-level intent of a user remains, making 
it difficult for people to predict the behavior of content-based 
search systems. 

In light of these shortcomings, we propose a novel mode 
of feedback where a user directly describes how high-level 
properties of exemplar images should be adjusted in order to 
more closely match his/her envisioned target images. For ex¬ 
ample, when conducting a query on a shopping website, the 
user might state: ‘T want shoes like these, but more formal!' 
When browsing images of mug shots of suspects, a witness 
to a crime could say: “He looked like this, but with longer 
hair and a broader noise!' When searching for stock photos 
to fit an ad, he might say: “I need a scene similarly bright as 
this one and more urban than that one.” See Figure[2 In this 
way, rather than simply state which images are (ir)relevant, 
the user employs semantic terms to say how they are so. 
Such feedback enables the system to more closely match the 
user’s mental model of the desired content, with less total 
interaction effort compared to conventional click-based rel¬ 
evance feedback. We call the approach WhittleSearch, since 
it allows users to “whittle away” irrelevant portions of the 
visual feature space via precise, intuitive statements of their 
attribute preferences. 


Briefiy, our relative attribute feedback approach works 
as follows. Offline, we first learn a set of ranking functions, 
each of which predicts the relative strength of a nameable at¬ 
tribute in an image (e.g., the degree of shininess, furriness, 
etc.). At query time, the system presents some reference 
exemplar image(s), and the user provides relative attribute 
feedback on one or more of those images. Using the result¬ 
ing constraints in the multi-dimensional attribute space, we 
update the system’s relevance function, re-rank the pool of 
images, and display to the user the next exemplar image(s). 
This procedure iterates using the accumulated constraints 
until the top ranked images are acceptably close to the user’s 
target. 

In this pipeline, a key question is which exemplar images 
should be shown to the user for feedback. To address this 
question, we explore two variants of the proposed Whittle- 
Search approach: one where the user decides which images 
require relative attribute feedback, and one where the sys¬ 
tem decides for which images it would most like the user’s 
feedback. 


In standard search interfaces, the user is shown a page 
of image results, i.e., those images the system currently esti¬ 
mates to be most relevant, and is free to react to any of them. 
Similarly, in the first of the two WhittleSearch variants, we 
present the user with reference images consisting of the top- 
ranked most relevant images and allow him/her to generate 
feedback that pairs any of those images with any attribute 
in our vocabulary. This setup gives the user the freedom to 
comment on exactly what he/she finds important for achiev¬ 
ing good image results. See Figure [^a). Since the presented 
reference images are those currently ranked best by the sys¬ 
tem, this formulation has the additional advantage that the 
user is shown only those results that are increasingly similar 
to the target image. 


However, the images believed to be most relevant need 
not be most informative for reducing the system’s uncer¬ 
tainty. Therefore, in the second WhittleSearch variant, we 
develop an active approach for selecting the reference im¬ 
ages for feedback. Intuitively, we want to solicit feedback on 
those exemplars that would most improve the system’s no¬ 
tion of relevance. Existing methods for actively guiding user 
feedback typically exploit classifier uncertainty to find use¬ 
ful exemplars, e.g., ( |Tong and Chang [ [200 lt|Li eral||2001t 
Cox et al| |2000t Zhou and Huang| 2003| ), or use clustering 


to distribute feedback among representative exemplars ( [Fer¬ 


ecatu and Geman 2007). Such traditional approaches have 


two main limitations. First, the imprecision of binary rel¬ 
evance feedback (“Image X is relevant; image Y is not.”) 
clouds the active selection criterion because extrapolation 
of the feedback to other images is unreliable. Second, ex¬ 
isting active selection techniques add substantial computa¬ 
tional overhead to the interactive search loop, since ideally 
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(b) System-elicited feedback 


Fig. 2 We consider two ways to elicit feedback for WhittleSearch: (a) a user-initiated approach, and (b) a system-initiated approach. In (a), the 
user browses the current top-ranked images and decides what to comment on. In (b), the system actively requests feedback on a specific image and 
attribute that is expected to most reduce its uncertainty about the relevance of the database images for this particular user query. 


they must scan all database images to find the most informa¬ 
tive exemplars. 

Taking these shortcomings into account, in our Active 
WhittleSearch formulation, we propose to guide the user 
through a coarse-to-fine search using relative attributes. At 
each iteration of feedback, the user provides a visual com¬ 
parison between the attribute in his envisioned target and a 
“pivot” exemplar, where a pivot separates all remaining rele¬ 
vant images into two balanced sets. We show how to actively 
determine along which of multiple such attributes the user’s 
comparison should next be requested, based on the expected 
information gain that would result. The resulting algorithm 
is reminiscent of the popular 20-questions game—except the 
questions generated by the system are comparative in na¬ 
ture. See Figure l^b). 

The active variant of our method works as follows. Given 
a database of images, we first construct a binary search tree 
for each relative attribute of interest (e.g., pointiness, shini¬ 
ness, etc.). Initially, the pivot exemplar for each attribute is 
the database image with the median relative attribute value. 
Starting at the roots of these trees, we predict the informa¬ 
tion gain that would result from asking the user how his tar¬ 
get image compares to each of the current pivots. To com¬ 
pute the expected gain, we devise methods to estimate the 
likelihood of the user’s response given the feedback history. 
Then, among the pivots, the most informative comparison 
is requested, generating a question to the user such as, “Is 
your target image more or less (or equally) pointy than this 
image?” Following the user’s response, the system updates 
its relevance predictions on all images and moves the cur¬ 
rent pivot down one level within the selected attribute’s tree, 
unless the response is “equally”, in which case we no longer 
need to explore this attribute tree. 

Notably, whereas prior information-gain methods would 
require a naive scan through all database images for each 
iteration, the proposed attribute search trees allow us to limit 
the scan to just one image per attribute. Thus, our method is 
efficient both for the system (which analyzes a small number 
of candidates per iteration) and the user (who locates his 
content via a small number of well-chosen interactions). 


Our main contribution is to widen human-machine com¬ 
munication for interactive image search by allowing users 
to communicate their preferences precisely and efficiently 
through visual comparisons. We demonstrate the two ver¬ 
sions of WhittleSearch applied to several realistic search 
tasks for shoes, people, and scenes. We compare our relative 
attribute feedback against traditional binary relevance feed¬ 
back, and we show that it refines search results more effec¬ 
tively, often with less total user interaction. We also present 
an approach which unifies the complementary strengths of 
relative attribute and binary feedback, allowing feedback of 
both types. We quantify the advantages of the active selec¬ 
tion of reference images over conventional active methods 
and a simpler binary search tree baseline that lacks our in¬ 
formation gain prediction model. The results strongly sup¬ 
port our pivot-based approach as an efficient means to guide 
user feedback. 


2 Related Work 


2.1 Interactive feedback in image search 

Relevance feedback has long been used to improve interac 
five image search ( [Kurita and Kato[ |1993[ |Cox et al| |2000 


|Rui et al||1998lfneu and VioT^ |2000t [Ferecatu and Gemaii 


2007t|Zhou and Huang||2003| ). The main idea is to tailor the 


system’s ranking function to the current user, based on his 
(usually iterative) feedback on the relevance of selected ex¬ 
emplar images. This injects subjectivity into the model, im¬ 
plicitly guiding the search engine to pay attention to certain 
low-level visual cues more than others. 

In a binary relevance feedback model, the user identifies 
a set of relevant images and a set of irrelevant images among 
the current reference set. The user can also identify which 


images are more relevant than others {Ferecatu and Geman 


|2007| ). While this is a relative comparison, just like in other 
binary relevance feedback methods, the system is not told 
in what way image X is more relevant than image Y. Im¬ 
age search results are produced by ranking all database im¬ 
ages using a classifier (or some other statistical model), and 
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the binary feedback supplies additional positive and nega¬ 
tive training examples to enhance that classifier. 

Like existing interactive methods, our approach aims to 
elicit a specific user’s target visual concept. However, while 
prior work restricts input to the form “A is relevant, B is not” 
or, as suggested by |Ferecatu and Geman ( 2007| ), “C is more 
relevant than D”, our approach allows users to comment pre¬ 
cisely on what is missing from the current set of results. We 
show that this richer form of feedback can lead to more ef¬ 
fective refinement. Being able to pinpoint how one image is 
more relevant than another (via attributes) is the key contri¬ 
bution of our approach. 

In practice, the images displayed to the user for feed¬ 
back are usually those ranked best by the system’s current 
relevance model. However, if a user is cooperative, it can be 
more valuable to present a mix of probable relevant and ir¬ 
relevant examples for feedback. If feedback is binary, with 
the user labeling examples as relevant (positive) or irrelevant 
(negative), the selection can naturally be cast as an active 
learning problem: the best examples to show are those that 
the relevance classifier is most uncertain about ( |MacArthur| 


problems also permit sequential search strategies that intel¬ 
ligently gather evidence within the image ( [Sznitman and Je- 


eFaTl [20001 |Tong and Chai^jl [200T| [LTiFall |200T| [Zh51I 


and Huan^|2Q03| ). Since focusing only on uncertain exam¬ 
ples may ignore parts of the feature space, an alternative 
strategy is to display images representative of clusters in the 
database ( [Ferecatu and Gem^|20Q7| ). 

Notably, prior efforts to display the exemplar image set 
that minimizes uncertainty were forced to resort to sam¬ 
pling or clustering heuristics due to the combinatorial opti¬ 
mization problem inherent when categorical feedback is as¬ 
sumed, e.g., ( |Cox et al||2000t[Ferecatu and Geman[|2007| ). In 
contrast, we show that eliciting comparative feedback on or¬ 
dinal visual attributes naturally leads to an efficient sequen¬ 
tial selection strategy, where each comparison is guaranteed 
to decrease the predicted relevance of half of the unexplored 
database images. 


2.2 Active testing and “20 questions” labeling 

Whereas we are interested in actively eliciting user feedback 
during search, active methods are also relevant for choosing 
a series of useful “tests” (e.g., features to extract) or label 
requests (“does the bird have a yellow beak?”) for recog¬ 
nition tasks ([Geman and Jedyn^ |1998[ [Sznitman and Je 


dynak[ [20T0{ [ Vij ayanarasimhan and Kapoor[ [20T0| [Branson 

et al[[2010| ). In the case where a human answers the tests, at¬ 
tributes are well-suited to query for intermediate labels that 
will lead to the right high-level label, as demonstrated for 
bird labeling tasks ( [Branson et al|[201Q| ). Under certain sce¬ 
narios, a globally optimal classification tree can be devised, 
so that an image is efficiently classified via a series of bi¬ 
nary tests ( [Geman and Jedyn^[1998[ ). Object localization 


dynak[ [2010[ [Vij ay anarasimhan and Kapoorj 201Q[ ). A re 


cent approach to categorization uses a human in the loop to 
provide responses to actively chosen similarity comparisons 
( [Wah et aT| [2014[ ). While this work employs relative com¬ 
parisons, the problem setting is different than the one con¬ 
sidered here. That work performs categorization of an image 
provided to the system, not retrieval of images that match a 
user’s mental model. 

Our Active Whittlesearch idea shares the spirit of rapidly 
reducing uncertainty through a sequence of useful questions. 
However, our aim is distinct. Active testing entails selecting 
queries to classify a single novel image efficiently, i.e., re¬ 
duce uncertainty over class labels for that image, whereas 
we select queries to efficiently find a target in a collection 
of images, i.e., reduce relevance uncertainty for all database 
images. Moreover, our approach solicits visual comparisons — 
key to eliminating irrelevant content in search—whereas prior 
work solicits traditional image labels. 


2.3 Attributes for image search 

Visual attributes are semantic properties of objects (e.g., fuzzy 
plastic) that serve as a middle ground between low-level fea¬ 
tures (e.g., color, texture) and high-level categories. When 
used in image search, the idea is to learn classifiers to pre¬ 
dict the presence of various high-level semantic concepts 
from a lexicon—such as objects, locations, activity types, or 
properties—and then perform retrieval in the space of those 
predicted concepts. Human-nameable semantic concepts or 
attributes are often used in the multimedia community to 
build intermediate representations for image retrieval ( [Smith 


et al|[2003|[Rasiwasia et al[[2007t[Naphade et al|[2006t[^ 


vesky and Chang[[7008t[Douze et al|[MTT|[Wang et^ [2011 


Scheirer et al 2012 Wang et al 2011[ Douze et al 2011[ ). 
They are especially valuable since they permit content-based 
keyword queries ( Kumar et~al| 2008} Siddiquie et al[ [20111 
[Scheirer et al|[2012 Rastegari et al 2013 1. While originally 
treated as categorical (“is smiling'' vs. “is not smiling"), at¬ 
tributes can more generally be modeled as continuous or 


relative properties (“is smiling more than X") ( [Parikh and 
Grauman 201 lb[ ). While prior work demonstrates that at 


tributes can provide a richer representation than raw low- 
level image features for image search, no previous work 
considers attributes as a handle for user feedback, as we 
propose. In addition, we generalize the class-based training 
procedure used in ( Parikh and Graum^[201 Ibj ) to learn rel¬ 
ative attributes, instead exploiting human-generated relative 
comparisons between image exemplars. 

This manuscript unifies and expands the work we ini¬ 
tially presented in ( [Kovashka et al[ [2012[ ) and ( [Kovashka 
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and Grauman 2013b| ), where we first proposed to use rel¬ 


ative attributes as a feedback mechanism for image search. 
In this manuscript, we bring together the core approach of 
those two papers. We analyze and discuss the advantages 
and disadvantages of the two forms of feedback, i.e., user- 
initiated free-form feedback and system-initiated active se¬ 
lection. We perform new experimental comparisons of the 
two versions of our method and examine when one is better 
than the other. Finally, we introduce several new qualitative 
results. 


2.4 Attributes for recognition 

Apart from image search, attributes have also gathered in- 


2009 

[Farhadi et al| [2009 

[Kumar et al| [2009 

[Wang and 

Mori 

[20101 [Branson et all 

20 101 [Patterson et al 

120141 IWah 


Since attributes are often shared among object categories 
(e.g., made of wood, plastic, has wheels), they are amenable 
to a number of interesting tasks, such as zero-shot learn¬ 
ing from category descriptions ( [Lampert et al[ |2009[ |Parikh 


and Grauman[|201 l"b[ [Patterson et al 2014[ Jayaraman et al| 


2014| ), describing unfamiliar or anomalous objects ( [Farhadi 


et al[ |2009t [Saleh et S] |2013| ), or categorizing with a 20- 


questions game ( [Branson et al| 2010| ). We explore relative 
attributes in the distinct context of feedback for image search 
Other work investigates training object recognition clas¬ 
sifiers with actively selected attribute labels. By modeling 
object-attribute ( [Kovashka et al[ 2011 ; Parkash and Parikh 
[2012t [Biswas and Parikhl [2013[ ) or attribute-attribute rela 


tionships ( [Zhang and Chen[ 2002 Mensink et al| 2011), one 
can request the most useful labels to refine the classifiers or 
propagate labels. Our goal is quite different: we do active 
exemplar selection for image search, not classification, and 
our approach requests visual comparisons, not attribute la¬ 
bels. 


3 Approach 

Our approach allows a user to iteratively refine the search 
using feedback on attributes. The user has some target im¬ 
age in mind—the imagined visual content the user wants to 
locate in the database. The target could be a literal image 
he/she has seen before, or simply a coarse mental model of 
the content of interest. The user initializes the search with 
some keywords—either the name of the general class of in¬ 
terest (“shoes”) or some multi-attribute query (“black high- 
heeled shoes”)—and our system’s job is to help refine from 
there. If no such initialization is possible, we simply begin 
with a random set of images for feedback. The top-ranked 


images are then displayed to the user, and the feedback- 
refinement loop begins. 

Each iteration of the loop consists of the following: (a) a 
choice on the part of the system regarding which reference 
image(s) to the display to the user for feedback; (b) a choice 
on the part of the user regarding which reference image(s) 
to comment on and/or a decision about the relationship be¬ 
tween the user’s target and the reference image(s); and (c) 
an update of the system’s notion of relevance, and thus the 
ranking of all images in the database. 

Throughout, letP = refer to the pool of 

N database images that are ranked by the system using its 
current scoring function St \ I ^ where t denotes the 
iteration of refinement. The scoring function is trained us¬ 
ing all accumulated feedback from iterations — 1, 

and it supplies an ordering (possibly partial) on the images 
in V. At each iteration, the top K < N ranked images 
Tt = {/ti,..., lix} ^ ^ are displayed to the user, where 
Stihi) > St{It 2 ) > •■• > A user then gives 

feedback of his choosing on any or all of the K refined re¬ 
sults in Tt (in the user-initiated WhittleSearch variant), or 
else he gives feedback specifically requested by the system 
on a particular image not necessarily among those in Tt (in 
the system-initiated WhittleSearch variant). 

In the following, we first discuss how to learn the rela¬ 
tive strength of an attribute in an image (Section [3T] ). Then 
we introduce the proposed new mode of relative attribute 
feedback and explain how the image search system uses this 
feedback to update its notion of relevance (Section [3^ . We 
then extend the idea to accommodate both our new rela¬ 
tive attribute feedback and traditional binary feedback in a 
hybrid approach (Section [33] ). Finally, we propose an ap¬ 
proach to relegate to the system the choice of the reference 
images for feedback, and explain how to select the optimal 
reference image in each round of interaction (Section [T4| ). 


3.1 Learning to Predict Relative Attributes 


Suppose we have a vocabulary of M attributes {am}m=i’ 
which may be generic or domain-specific for the image search 
problem of interest. For example, a domain-specific vocabu¬ 
lary for shoe shopping could contain attributes such as shini¬ 
ness, heel height, colorfulness, etc., whereas for scene de¬ 
scriptions it could contain attributes like openness, natural¬ 
ness, depth. While we assume this vocabulary is given, re¬ 
cent work suggests it may also be discovered automatically 
or semi-automatically ( [Berg et alj [20101 [Parikh and Grau 


man|[201 la][Maji[[2012[ [Patterson et al|[2014| ). 


Typically semantic visual attributes are learned as cate¬ 
gories: a given image either exhibits the concept or it does 
not, and so a classification approach to predict attribute pres¬ 
ence is sufficient ( [Rasiwasia et al|[2007][Naphade et al|[2006[ 
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Is Shoe 1 more or less feminine than Shoe 2? 

o Shoe 1 is more feminine than Shoe 2. 
o Shoe 1 is less feminine than Shoe 2. 
o Shoes 1 and 2 are eqwo//y feminine. 

How obvious was the previous answer? 

o Very obvious 
o Somewhat obvious 
o Subtle, not obvious 


Fig. 3 Interface for image-level relative attribute annotations. 


Zavesky and Chang||2008[ Lampert et al| 

2009t|Farhadi et al 

2009t|Kumar et al||2009t|Wang and Mori 

|2010l|Douze et al 


201 1| ). In contrast, to express feedback in the form sketched 


above, we require relative attribute models that can predict 
the degree to which an attribute is present. Therefore, we 
first learn a ranking function for each attribute in the given 
vocabulary. Note that one might informally treat classifier 
outputs as “strengths”, yet doing so is inconsistent with a 
training procedure that actually targets hard categorical la¬ 
bels. Results in ( [Parikh and Grauman[ |2011b| ) confirm that 
simply treating a binary classifier output value as the strength 
of presence is inferior in practice compared to training rank¬ 
ing functions. 

For each attribute a^, we obtain supervision on a set 
of image pairs (i, j) in a training set X. We ask human an¬ 
notators to judge whether that attribute has stronger pres¬ 
ence in image i or j, or if it is equally strong in both. Such 
judgments can be subtle, so on each pair we collect up to 
five redundant responses from multiple annotators on Ama¬ 
zon Mechanical Turk (MTurk); see Figure To distill re¬ 
liable relative constraints for training, we use only those 
for which most labelers agree. This yields a set of ordered 
image pairs Om = {{hj)} ^nd a set of un-ordered pairs 
Em = such that {i,j) £ Om ^ i y j, i.e. 

image i has stronger presence of attribute than image j, 
and (i, j) G Em ^ and j have equivalent 

strengths of a^. 

We would like to emphasize the design for constraint 
collection: rather than ask annotators to give an absolute 
score refiecting how much the attribute m is present, we in¬ 
stead ask them to make comparative judgements on two ex¬ 
emplars at a time. This is both more natural for an individual 
annotator, and it also permits seamless integration of the su¬ 
pervision from many annotators, each of whom may have a 
different internal “calibration” for the attribute strengths. 


Next, to learn an attribute’s ranking function, we employ 
the large-margin formulation of Joachims ( [Joachims 2002] ), 
which was originally shown for ranking web pages based 
on clickthrough data, and used for relative attribute learn¬ 
ing ( jParikh and GraumiSi| [201 Ibj ). Suppose each image f 
is represented in by a feature vector Xi (we use color 
and GIST; more details below). We aim to learn M ranking 


functions, one per attribute: 


(-^i) 


: W^Xi 


( 1 ) 


for m = 1,..., M, such that the maximum number of the 
following constraints is satisfied: 




: wZxi 


> wZx. 


( 2 ) 


Joachims’ algorithm approximates this NP-hard prob¬ 
lem by introducing (1) a regularization term that prefers a 
wide margin between the ranks assigned to the closest pair 
of training instances, and (2) slack variables ^ij on the con¬ 


straints, yielding the following objective ( Joachims] 2002| ): 


minimize 






(3) 


/ 

s.t. w'l^Xi > w'l^Xj + 1 - G Om 

where (7 is a constant penalty. The objective is reminiscent 
of standard SVM training (and is solvable using similar de¬ 
composition algorithms), except the linear constraints en¬ 
force relative orderings rather than labels. While shown here 
in the linear form, the method is also kernelizable. We use 
Joachims’ SVMRankcode (Joachims 2006 ) ^ 

Having trained M such functions, we are then equipped 
to predict the extent to which each attribute is present in any 
novel image, by applying the learned functions ai,..., om 
to its image descriptor x. This training is a one-time process 
done before any search query or feedback is issued. Further¬ 
more, the data X used for training attribute rankers is not 
to be confused with our database pool V; the two may be 
disjoint sets of images. 

Whereas [Parikh and Graumaii ( 2011b ) propose generat¬ 
ing supervision for relative attributes from top-down cate¬ 
gory comparisons (“person X is (always) more smiley than 
person Y”), our approach extends the learning process to in¬ 
corporate image-level relative comparisons (“image A ex¬ 
hibits more smiling than image B”). While training from 
category-level comparisons is clearly more expedient, we 
find that image-level supervision is important in order to re¬ 
liably capture those attributes that do not closely follow cat¬ 
egory boundaries. The smiling attribute is a good example of 
this contrast, since a given person (the category) need not be 
smiling to an equal degree in each of his/her photos. In fact, 
our user studies on MTurk show that category-level relation¬ 
ships violate 23% of the image-level relationships specified 
by human subjects for the smiling attribute. In the results 
section, we detail related human studies analyzing the bene¬ 
fits of image-level comparisons. 


^ Note that one can also use the equality constraints in Em for train¬ 
ing these ranking functions, as in jParikh and GraumanjpOl lb} . In our 
approach, we use these constraints to compute parameters for scoring 
relevance, in Section [T^ 
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Fig. 4 Sketch of WhittleSearch relevance computation. This toy ex¬ 
ample illustrates the intersection of relative constraints with M = 2 
attributes. The images are plotted on the axes for both attributes. The 
space of images that satisfy each constraint are marked in a differ¬ 
ent color. The region satisfying all constraints is marked with a black 
dashed line. In this case, there is only one image in it (outlined in 
black). Best viewed in color. 


3.2 Relative Attribute Feedback 


Next we define the basic WhittleSearch framework. With the 
ranking functions learned above, we can now map any image 
from the database V into an M-dimensional space, where 
each dimension corresponds to the relative rank prediction 
for one attribute. It is in this feature space we propose to 
handle query refinement from a user’s feedback. 

A user of the system has a mental model of the target 
visual content he seeks. To refine the current search results, 
he surveys the K top-ranked images in 7t, and uses some 
of them as reference images with which to better express 
his envisioned optimal result. These constraints are of the 
form “What I want is more/less/similarly A than image it/\ 
where A is an attribute name, and It^ is an image in % (the 
subscript t / denotes it is a reference image for /eedback at 
iteration t). For now, suppose these relative constraints are 
given for some combination of image(s) and attribute(s) of 
the user’s choosing. Later, in Section [3^ we will consider 
how instead the system can guide the choice of the image 
and attribute for feedback so as to most quickly reduce its 
uncertainty about what the user wants. 

The WhittleSearch system accumulates this feedback 
from the user during each round of interaction, each time up¬ 
dating the relevance it associates with each database image. 
Intuitively, the user’s statements about relative preferences 
serve to carve out a relevant region of the M-dimensional 
attribute feature space, whittling away images not meeting 
the user’s requirements. See Figure]^ Accordingly, we next 
define a relevance function that predicts the extent to which 
a database image matches the user’s target. It is a proba¬ 
bilistic model of relevance to account for the fact that pre¬ 
dicted attribute values can deviate from true perceived at¬ 
tribute strengths to some extent]^ 

^ We do, however, assume that all users would agree on the true at¬ 
tribute strength in a given image. See |Kovashka and Grauman| ( [2013al 
for an approach to model the user-specific perception of an attribute. 


Let yi G {1,0} denote the binary label for image It, 
which reflects whether it is relevant to the user (matches 
his target), or not. Let T = denote the set 

of comparative constraints accumulated in the T rounds of 
feedback so far. The t-th item in T, Tu consists of a ref¬ 
erence image for attribute m, and a user response r G 
{“more”, “less”, “equally”}. The final output of our search 
system will be a sorting of the database images in V accord¬ 
ing to their probability of relevance, given the image content 
and all user feedback: P{yi = Ij/^, for i = 1,..., W. 

Let St,i G {0,1} be a binary random variable represent¬ 
ing whether image li satisfies the t-th feedback constraint. 
For example, if the user’s t-th comparison on attribute m 
yields response r = “more”, then St^i = 1 if the database 
image It has attribute m more than the corresponding refer¬ 
ence image . We assume that the probability of an image 
satisfying a given constraint is independent of it satisfying 
another given constraint. The probability that database im¬ 
age li is relevant is the probability that it satisfies all T feed¬ 
back comparisons in 

T 

Pivi= = n = i\ii,Pt)- ( 4 ) 

t=i 

For numerical stability, we replace the product above with a 
sum of log probabilities: 


T 

\ogP{yi = l\Ii,P) = Y,\ogP{St,i = l\Ii,Pt)- (5) 

t=l 

The probability that an individual constraint is satisfied 
given that the user’s response was r for reference is: 


P{St,i = l\IuPt) 


'P{Am{Ii) > Amiltf)) if r = “more” 

< P{Am{Ii) < Am{Itf)) if r = “less” 
^P{Am{Ii) = Am{Itf)) Ar = “equally”, 


where Am{I) denotes the true strength of attribute m in im- 
age /. Note that we do not observe these true attribute val¬ 
ues directly; rather, what we observe are the system’s pre¬ 
dicted attribute values am{Ii), which are necessarily imper¬ 
fect. While the predicted attribute ranks are a function of the 
true latent attribute strengths Am{Ii), they need not agree 
exactly. Therefore, we estimate the probabilities required 
above by mapping the attribute predictions a^(') to prob¬ 
abilistic outputs. We adapt Platt’s method ( |Platt[ |1999| ) to 
the paired classification problem implicit in the large-margin 
ranking objective from Eqn. Specifically, this yields: 

7 7 Tyv TJ \\ o 7 

1 H“ exp( 0 ^ 772 (^^772(Fi) QjYn\Itf)) 3“ (^m) 


P{Am{Ii) < Am(Itf)) = 1 - P{Am{Ii) > Am{Itf)), and 
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P{A^{Ii) = Amiltf)) 


1 + exp( 7 „| 

dm di) dm 

( 6 ) 


The sigmoid parameters are learned using the sets Om and 
Em from above. In particular, to learn am and /Sm, we use 
pairs with “more” judgments from Om as positive paired- 
instances, and “less” judgments as negative instances. For 
^m and 6m, we use “equally” pairs from Em as positive 
labels, and both “more” and “less” responses from Om as 
negative instances. We normalize these values so the three 
probabilities (“more”/“less”/“equally”) sum to 1. 

The relevance function defined above takes the user’s in¬ 
put at face value. Namely, if the user does not comment on 
an attribute within the image, we assume we have no infor¬ 
mation about that attribute. In other work, we explore how 
this assumption can be relaxed to learn the implicit cues a 
user reveals in his/her attribute feedback ( [Parikh and Grau-| 


man 


2013). For example, if a user elects to tell the system 


that his target is less shiny than some reference reference X, 
and the reference image set the user saw contained another 
image Y that is less shiny than X, then the system could 
infer that the target is not less shiny than Y —otherwise. 


he would have provided that tighter constraint ( Parikh and 
[Grauman||2013| ). 

We stress that the proposed form of relative attribute 
feedback refines the search in ways that a straightforward 

^ Kumar et al| ( |2008| ). 


relative attribute feedback can have complementary strengths: 
when reference images are nearly on target (or completely 
wrong in all aspects), the user may be best served by pro- 
^n^^ing a simple binary relevance label. Meanwhile, when a 
reference image is lacking only in certain describable prop¬ 
erties, he may be better served by the relative attribute feed¬ 
back. Thus, it is natural to combine the two modalities, al¬ 
lowing a mix of feedback types at any iteration. 

In a binary relevance feedback model, the user identi¬ 
fies a set of relevant images IZ and a set of irrelevant images 
1Z among the current reference set 7t. In this case, the rele¬ 
vance scoring function is a classifier (or some other statisti¬ 
cal model), and the binary feedback essentially supplies ad¬ 
ditional positive and negative training examples to enhance 
that classifier. That is, the scoring function at iteration t -t- 1 
is trained with the data that trained the model at iteration t 
plus the images in IZ labeled as positive instances and the 
images in 1Z labeled as negative instances. 

We can augment the Whittlesearch system with binary 
feedback to define a learned hybrid scoring function. The 
basic idea is to learn a ranking function that unifies both 
relative attribute and binary feedback. Let Ck C V denote 
the subset of database images satisfying k of the relative at¬ 
tribute feedback constraints, fork = 0,..., F. We define a 
set of ordered image pairs 


multi-attribute query (e.g., as developed by 


|Siddiquie et al| ( [MTT] ), and [Scheirer et al ( 2Q12| )) cannot. 


That is, if a user were to simply state the attribute labels 
of interest (“show me black shoes that are shiny and high- 
heeledO, one can easily retrieve the images whose attribute 
predictions meet those criteria. However, since the user’s 
description is in absolute terms, it cannot evolve based on 
the retrieved images. In contrast, with access to relative at¬ 
tributes as a mode of communication, for every new set of 
reference images returned by the system, a WhittleSearch 
user can further refine his description. In addition, when a 
user states that a reference image has the attribute “equally” 
to his target, he reveals more precise information than tra¬ 
ditional binary relevance feedback. In the former, we learn 
about the reference image’s quality in the context of an in¬ 
dividual attribute; in the latter, one learns only the coarse 
information that the image seems good or bad, across all at¬ 
tributes. 


Os = {{TZ X IZ} U {Cf X Cf-i} U • • • U {Ci x Cq}}, (7) 

where x denotes the Cartesian product. This set Os refiects 
all the desired ranking preferences—that relevant images be 
ranked higher than irrelevant ones, and that images satisfy¬ 
ing more relative attribute preferences be ranked higher than 
those satisfying fewer. Note that the subscript s in Os distin¬ 
guishes the set from those indexed by m above, which were 
used to train relative attribute ranking functions in Section 

EH 

Using training constraints Os we learn a function that 
predicts relative image relevance for the current user with 
the large-margin objective in Eqn.|^ The result is a param¬ 
eter vector Ws that serves as the hybrid scoring function. 
Since there are many more pairs in Os that come from rela¬ 
tive attribute feedback than from binary relevance feedback, 
we set the penalty on the binary feedback pairs to be in¬ 
versely proportional to the fraction of such pairs in the set 

O.. 


3.4 Active WhittleSearch with Attribute Pivots 


3.3 Hybrid Feedback Approach 

So far, we have considered relative attribute feedback in iso¬ 
lation and discussed its advantages over traditional binary 
relevance feedback. However, binary relevance feedback and 


Thus far, we have assumed that the user will freely select 
the feedback statements he wishes to give the system from 
among the top-ranked images. This is the first user-initiated 
variant of WhittleSearch, and it is most suited when a user 
wishes to browse at the same time he refines his own men¬ 
tal model of the target. However, as argued above, when a 
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user has a precise target in mind, it can be more beneficial to 
leave the choice of reference images for feedback to the im¬ 
age search system. Thus, we next present an active variant 
of WhittleSearch. 

In Active WhittleSearch, the interaction mode involves 
a series of multiple-choice questions that the human user 
needs to answer, of the type: “Is the image you are looking 
for more, less, (or equally) A than image /?”, where A is 
a semantic attribute and / is an exemplar from the database 
being searched. Our goal is to generate the series of such 
questions that will most efficiently narrow down the rele¬ 
vant images in the database, so that the user finds his target 
in few iterations. To this end, at each iteration we will ac¬ 
tively select a comparison for the user to provide, that is, 
the (A, I) pair that yields the expected maximal informa¬ 
tion gain. Rather than exhaustively search all database im¬ 
ages as potential exemplars, however, we consider only a 
small number of pivot exemplars—the internal nodes of bi¬ 
nary search trees constructed for each attribute. The output 
of the system is the list of database images, sorted by their 
predicted relevance. 

As above. Active WhittleSearch also relies on predicted 
attribute values (Section |3.1| ) and a manner of updating the 
system’s notion of relevance after each feedback statement it 
receives from the user (Section [3^ . It also relies on binary 
search trees, whose construction we explain next (Section 


3.4.1). Then, we introduce our active selection approach to 


determine which comparison should be requested next (Sec¬ 
tion [T4^ using the probabilistic model of image relevance 
defined in Section [3^ above. 


3.4. J Attribute Binary Search Trees 

For each attribute m = 1,..., M, we construct a binary 
search tree. The tree recursively partitions all the database 
images into two balanced sets, where the key at a given 
node is the median relative attribute value occurring within 
the set of images passed to that node. To build the m-th at¬ 
tribute tree, we start at the root with all database images, 
sort them by their attribute values a^(/i),..., a^(/Ar), and 
identify the median value. Let Ip denote the “pivot” image— 
the one that has the median attribute strength. Those im¬ 
ages exhibiting the attribute less than Ip, i.e., all li such 
that am{Ii) < am{Ip), are passed to the left child, while 
those exhibiting the attribute more, i.e., am{Ii) > am{Ip), 
are passed to the right child. Then the splitting repeats recur¬ 
sively, each time storing the next pivot image and its relative 
attribute value at the appropriate node. 

Note that both the relative attribute ranker training and 
the search tree construction are offline procedures; they are 
performed once, before handling any user queries. 

Already, one could imagine a search procedure that walks 
a user through one such attribute tree, at each successively 


deeper level requesting a comparison to the pivot, and then 
eliminating the appropriate portion of the database depend¬ 
ing on whether the user says “more” or “less”. However, 
there are two problems with such a simple approach. First, 
we cannot assume that the attribute predictions are identi¬ 
cal to the attribute strengths a user will perceive; thus, a 
hard pruning of a full sub-tree is error-prone. Second, it fails 
to account for the variable information gain that could be 
achieved depending on which attribute is explored at any 
given round of feedback. Therefore, we use the probabilis¬ 
tic representation of whether images satisfy the comparison 
constraints, as defined in Section [3^ and we use the pivots 
to limit the pool of candidate images that are evaluated for 
their expected information gain, as we will explain next. 

3.4.2 Actively Selecting an Informative Comparison 

Our system maintains a set of M current pivot images (one 
per attribute tree) at each iteration, which we denote by 
V = {Ip ^,..., Ip^ }, where V CV. The pivots are initially 
the root pivot images from each tree. During active selec¬ 
tion, our goal is to identify the pivot in this set that, once 
compared by the user to his target, will most reduce the en¬ 
tropy of the relevance predictions on all database images in 
V. Note that selecting a pivot corresponds to selecting both 
an image as well as an attribute along which we want it to 
be compared. That is, Ip^ refers to the pivot for attribute m. 

Entropy reduction objective. Given the feedback history T, 
we want to predict the information gain across all N database 
images for each pivot in V. We will request a comparison 
for the pivot that most reduces the total relevance entropy 
over all images—or equivalently, the pivot that minimizes 
the expected entropy when used to augment the current set 
of feedback constraints. 

The entropy based on the feedback thus far is: 

N 

= - E E log 

i=l i 

( 8 ) 

where i G {0,1}. Let be a random variable denoting the 
user’s response, R G {“more”, “less”, “equally”}. We select 
the next pivot for comparison as: 

j; = argmin V P{R = r| , P) H{F U , r)), (9) 

where H{jF C (Ip^^r)) denotes the entropy computed on 
the accumulated feedback when it is further augmented with 
the hypothetical response r on pivot image Ip^, and P{R = 
r\Ip^,R) is the likelihood of the user giving the response 
r. In other words, the most informative pivot—the one the 
user should next compare his target image to—is the pivot 
that most reduces the expected entropy. 
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User response likelihood. Optimizing Eqn. [^requires esti¬ 

Pointy: 
more or less? 

1 


k 

Shiny: 

more or less? 

mating the likelihood of each of the three possible user re- 


K 





sponses to a question we have not issued yet. We develop 
three possible strategies to estimate it. In each case, we use 
cues from the available feedback history to form a “proxy” 
for the user, essentially borrowing the probability that a new 
constraint is satisfied from previously seen feedback. 

For the first strategy, which we call All Relevant, we 
use all relevant database images as the proxy. The assump¬ 
tion is that the images that are relevant to the user thus far are 
(on the whole) more likely to satisfy the user’s next feedback 
than those that are irrelevant. This is reminiscent of standard 
practice in active classifier training, where posteriors esti¬ 
mated with the current classifier are used as weights in the 
expected entropy reduction of acquiring a new label. Ideally 
we would average the P{Sc,i = 1 |/i, Jc) values among only 
the relevant images J^, where c indexes the candidate new 
feedback for a (yet unknown) user response R. Of course, 
we can only predict relevance, so we compute the weighted 
probability of each possible response R\ 


N 


PaiiiR = = l\Ii,P)PiS,,i = l\L. 

( 10 ) 





Shiny: 

more or less? 






Fig. 5 The Active WhittleSearch variant requests feedback on images 
that elicit the most information, using binary search trees to focus the 
active selection. In this sketch, M = 2 attribute trees are shown. Im¬ 
ages with the same color outline are the pairs considered at each round, 
and the number in this color marks the image chosen at this round. Red 
arrows denote the user’s responses. Here, first the user is asked to com- 
: his target to the boot pivot (1) in terms of pointiness; then he is 

asked to compare it to (2) in terms of shininess, followed by (3) in 
terms of pointiness, and so on. Best viewed in color. 


where the all subscript stands for All Relevant. 

The second strategy, which we call Most Relevant, 
is similar, but uses only our current best guess for the target 
image as the proxy: 


PmostiR = r\Ip^.R) = P{Sc,b = 114, ^c), (11) 


where 4 is the database image that maximizes P{yi = 114, J^), 
for i = 1,..., A^. 

The third strategy, which we call Similar Question, 
examines all previously answered feedback requests, and 
copies the answer from the question that is most similar to 
the new one. We define question similarity in terms of the 
Euclidean distance between the pivot images’ descriptors 
plus the similarity of the two attributes involved in either 
question. We quantify the latter by the Kendall’s r correla¬ 
tion between the ranks they assign to a set of validation im¬ 
ages. For example, this reflects that feminine and heel height 
are more aligned ihan feminine and grayness. Let denote 
the response to the most similar question k found in the his¬ 
tory T for the new pivot 4^ under consideration. Then we 
have: 


Pquestion{P — ^14m 5 *4) 


1 ifr = rl 
0 otherwise. 


( 12 ) 


We evaluate all three likelihood strategies in the results. 


Recap of Active WhittleSearch interaction loop. At each it¬ 
eration, we present the user with the pivot selected with 
Eqn.j^and request the specified attribute comparison. Then, 
we (1) use his response to update T with that additional 
image-attribute-response constraint, and (2) either replace 
the pivot in V for that attribute with its appropriate child 
pivot (i.e., the left or right child in the binary search tree 
if the response is “less” or “more”, respectively) or termi¬ 
nate the exploration of this tree (if the response is “equally”). 
Note that this means that the set of pivots consists of point¬ 
ers into the binary trees at varying levels. See Figurej^ This 
is because our active selection criterion considers which at¬ 
tribute will most benefit from more refined feedback at any 
point in time. In contrast, a simpler solution that alternates 
between the attribute trees in sequence need not reduce un¬ 
certainty as efficiently, as we will show in the results. 

Finally, the approach iterates until the user is satisfied 
with the top-ranked results, or until all of the attribute trees 
have bottomed out to an “equally” response from the user 
(in which case, our method can gain no further knowledge 
about the target given the available attribute vocabulary). 

The cost of our selection method per round of feedback 
is 0{MN), where M is the size of the attribute vocabulary, 
N is the database size, and M N. For each of 0{M) piv¬ 
ots which can be used to complement the feedback set, we 
need to evaluate expected entropy for all N images. In con¬ 
trast, a traditional information gain approach would scan all 
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database items paired with all attributes, requiring 0{MN‘^) 
time. The proposed binary search trees exploit the ordinal 
values of relative attributes to make this complexity reduc¬ 
tion possible. 

3.5 Discussion 

Having described both WhittleSearch variants, we can now 
compare and contrast them in detail. Recall that feedback 
is user-initiated in the first variant: the system presents the 
user with the top-ranked current results, and the user freely 
chooses those on which he wishes to provide comparative 
feedback. The second variant. Active WhittleSearch, is 
system-initiated: the system asks the user for a visual com¬ 
parison between the envisioned target image and an actively 
selected reference image along a specific attribute. Both vari¬ 
ants have potential advantages that are revealed under differ¬ 
ent scenarios. 

Active WhittleSearch makes a choice that is optimal with 
respect to the knowledge that the image search system pos¬ 
sesses. This can be likened to a situation where we rely on 
a student’s own understanding of what he knows in order to 
improve his knowledge. However, unlike WhittleSearch, the 
set of images that is shown to the user for feedback is often 
disjoint from those that are ranked highest by the system. 
Therefore, the user must separately examine the images for 
feedback and the image results. 

In contrast, WhittleSearch gives the human user several 
options about the reference images and attributes on which 
to comment. Therefore, the performance of the system de¬ 
pends both on the choices that the user makes, as well as 
the correctness of the response that the user gives on the 
chosen pairing of image and attribute. In this case, we rely 
on the human “teacher” to know what additional informa¬ 
tion to give to the system “learner”. WhittleSearch also re¬ 
quires more time for the completion of one feedback state¬ 
ment compared to Active WhittleSearch, since it requires the 
user to examine a set of options and choose among them. 

In cases when the user does not wish to spend much time 
considering which image and attribute to comment on, we 
expect that Active WhittleSearch will be preferred. For ex¬ 
ample, the user might choose to comment on those com¬ 
parisons which are most obvious, which might not be very 
informative to the system. However, if the user is careful 
and experienced enough with the system to pick informative 
comparisons, WhittleSearch can perform better. For exam¬ 
ple, the user might see a unique attribute which is important 
for discriminating between relevant and irrelevant images, 
which the system has not asked about yet. This will be par¬ 
ticularly important if there is a large discrepancy between 
the human perception of an attribute and the system ranking 
for this attribute, in which case the entropy reduction esti¬ 
mates might be inaccurate. 


Another factor which affects how well the two versions 
of WhittleSearch perform is the number of feedback state¬ 
ments that the system has received so far. As we will show in 
our results (Section [4^ , the entropy-based selection crite¬ 
rion is most crucial early on in the iterative cycle. Thus, we 
expect the advantage of Active WhittleSearch over Whittle¬ 
Search to be stronger in the first few iterations. 

Finally, the level of specificity of the user’s target might 
affect WhittleSearch and Active WhittleSearch’s compara¬ 
tive performance as well. If the user is simply browsing, 
WhittleSearch might be preferable as it gives him more free¬ 
dom to explore the current results and refine or terminate the 
search, depending on the precise qualities of the desired tar¬ 
get. For example, a user shopping for a product with only a 
vague preconception of what is desired may be best suited 
by WhittleSearch. However, if the user has a very specific 
target in mind. Active WhittleSearch might be more help¬ 
ful, as the use of binary search trees helps narrow down the 
search to the exact range of the attribute value distribution 
that matches the “signature” of the target image. The feasi¬ 
bility of browsing can be affected by the size of the search 
interface. For example, it might be harder to browse refer¬ 
ence images on a small mobile phone screen, which speaks 
in favor of eliminating user choice for the feedback state¬ 
ments, and trying to pinpoint the exact object that the user 
has in mind. 

Figures and [7] show two qualitative comparisons of 
the two WhittleSearch variants, which illustrate some of the 
tradeoffs discussed. The first figure shows user-chosen feed¬ 
back that does not point out the most distinctive features of 
the target image, while the second shows particularly valu¬ 
able user-chosen feedback. 


4 Experimental Results 


We first explain our experimental setup in Section 4.1 In 


Section |4.2| we analyze how the proposed relative attribute 
feedback can enhance image search compared to classic bi¬ 
nary feedback, and study which factors infiuence their be¬ 
havior. Then, in Section |4.3| we compare our active selec¬ 
tion method in the Active WhittleSearch variant to alterna¬ 
tive selection strategies to demonstrate its benefits. Finally, 


in Section 4.4 we experimentally compare WhittleSearch 
and Active WhittleSearch. 


4.1 Experimental Design 


Datasets. We use three datasets in order to validate our ap¬ 
proach in diverse domains of interest: finding products, peo¬ 
ple, and scenes. The datasets are: 


- The Shoes dataset from the Attribute Discovery Dataset 
( [Berg et |2Q10| ), which contains 14,658 shoe images 
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1 

P 


J 



More ornamented More pointy More bright 




Feedback: Target is 


than this image . 



Equally shiny More open Equally open 



Rank of target: 369 
NDCG@50: 0.107 


More formal Less bright 


— nciim ui iciryei. iviore origm; Less orignt - naim ui largei. ooo iviore Tormai Less orignt 

NDCG@50:0.107 Results: NDCG@50:0.107 


WhittieSearch 


WhittieSearch 




Results: 


Rank of target: 2 
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Less bright Less shiny 




Fig. 6 An example where Active WhittieSearch outperforms Whittie¬ 
Search. Observe how the active selection focuses on the two most dis¬ 
tinctive features of this shoe, namely its color and ornaments, which 
the human user fails to do. The user gives feedback that is obvious 
yet not very discriminative; most shoes are less long-on-the-leg than a 
boot, and many shoes in this dataset are higher at the heel than a run¬ 
ning shoe. See Sectionj^for all implementation details leading to this 
result. 


belonging to 10 shoe categories collected from the web¬ 
site like . com. We augment the data with 10 relative 
attributes— pointy at the front, open, bright in color, cov¬ 
ered with ornaments, shiny, high at the heel, long on the 
leg, formal, sporty, dind feminine. 

- The Public Figures dataset of human faces ( [Kumar efal 


[2009 j ) (Faces). We use the subset from ( jParikh and Grau 


man 


2011b), which contains 772 images from 8 peo¬ 


ple and 11 attributes— masculine-looking, white, young, 
smiling, chubby, visible forehead, bushy eyebrows, nar¬ 
row eyes, pointy nose, big lips, and round face. 

- The Outdoor Scene Recognition dataset of natural scenes 
( jOliva and Torralb^ [200 Ij ) (Scenes), which consists of 
2,688 images from 8 categories and 6 attributes— natural, 
open, perspective, large objects, diagonal plane, and close 
depth ( jParikh and Graum^|2011b| . 


Features. For image features x, we use GIST (Oliva and 


Torralba 200 Ij ) and LAB color histograms for Shoes and 
Faces, and GIST alone for Scenes. We omit color for Scenes 


Fig. 7 An example where WhittieSearch outperforms Active Whittie¬ 
Search. While Active WhittieSearch does a fair job, this particular user 
of WhittieSearch gave very useful feedback, which allowed the sys¬ 
tem to rank the target image nearly at the top of the results page. See 
Sectionj^for all implementation details leading to this result. 


because we expect that the majority of scene attributes can¬ 
not be captured with color features. The GIST descriptor 
captures the overall texture of the image, summarizing gra¬ 
dient orientations in a grid of spatially localized cells. The 
color histogram summarizes the color distribution in the im¬ 
age, offering complementary information to the GIST de¬ 
scriptor. For Shoes, we concatenate a 960-dimensional GIST 
feature vector (4 blocks and 8-8-4 orientations per scale) 
and a 30-dimensional color feature vector (10 bins). For 
Scenes, we use a 512-dimensional GIST vector. For Faces, 
we concatenate a 512-dimensional GIST vector and a 30- 
dimensional color vector. 

Methodology. For each query we select a random target im¬ 
age and score how well the search results match that tar¬ 
get after feedback. This target stands in for a user’s mental 
model; it allows us to prompt multiple subjects for feedback 
on a well-defined visual concept, and to precisely judge how 
accurate results are. This part of our methodology is key to 
ensure consistent data collection and formal evaluation. 

We use two evaluation metrics: (1) the ultimate percentile 
rank assigned to the user’s target image, which measures the 
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fraction of database images ranked below the true target, and 
(2) the correlation between the full ranking computed by 
the method’s relevance scoring function and a ground truth 
ranking that reflects the perceived relevance of all images in 
V. For both metrics, higher scores are better. 

The correlation metric captures not only where the target 
itself ranks, but also how similar to the target the other top- 
ranked images are. We form the ground truth relevance rank¬ 
ing by sorting all images in V by their distance to the given 
target. To ensure this distance reflects perceived relevance, 
we learn a metric based on human judgments. Speciflcally, 
we show 750 triplets of images (i, j, k) from each dataset 
to seven Mechanical Turk human subjects, and ask whether 
images i and j are more similar, or images i and k. Using 
their responses, we learn a linear combination of the image 
and attribute feature spaces that respects these constraints 
2002| ). Our ground truth rankings thus mimic 
human perception of image similarity. To score correlation, 
we use Normalized Discounted Cumulative Gain at top K 
(NDCG@K) ( [Kekalainen and Jarvelin[|2QQ2| ). This is a stan¬ 
dard information retrieval metric that scores how well the 
predicted ranking and the ground truth ranking agree, while 
emphasizing items ranked higher. We use AT = 50, based 
on the number of images visible on a page of image search 
results. 


via (Joachims 


Baseline. The key baseline against which we compare Whit¬ 
tleSearch is traditional binary relevance feedback. This base¬ 
line is intended to represent existing approaches such as (|Co] 


let all [20001 [Ferecatu Geman[|20Q7||Rui et all|1998|[Tieu| 

[and Viola]|2000| ). While a variety of classiflers have been ex¬ 
plored in such previous systems, we employ a support vec¬ 
tor machine (SVM) classifler for the binary feedback model 
due to its strong performance in practice. Thus, the relevance 
scoring function for the binary feedback baseline is the mag¬ 
nitude of the SVM output. (We defer the deflnition of the 
additional baselines against which we test Active Whittle- 
Search until Sectionp^) 


highest average confldence levels, assuming that a user will 
select that response of which he is most confldent. 

Since the human annotations are costly, for certain stud¬ 
ies below we generate feedback automatically. For relative 
constraints, we randomly sample constraints based on the 
predicted relative attribute values, checking how the target 
image relates to the reference images. In other words, the 
simulated user randomly chooses an attribute and one of the 
n top-ranked images at that round, and compares his tar¬ 
get image to the chosen reference image along the given 
attribute dimension. For example, if the target’s predicted 
“shininess” is 0.5 and the reference image’s “shininess” is 
0.6, then a valid constraint is that the target is “less shiny” 
than that reference image. For binary feedback, we analo¬ 
gously sample positive/negative reference examples based 
on their image feature distance to the true target. In particu¬ 
lar, we sort the n currently top-ranked in terms of their Eu¬ 
clidean distance in raw feature space to the target image. We 
then generate constraints that say the top quartile of these 
images are “similar to” the target image, while the bottom 
quartile are “dissimilar from” the target. 

When scoring rank, we add Gaussian noise to the pre¬ 
dicted attributes (for our method) and the SVM outputs (for 
the baseline), to coarsely mimic human uncertainty in con¬ 
straint generation. The automatically generated feedback is 
a good proxy for human feedback since the relative predic¬ 
tions are explicitly trained to represent human judgments. It 
allows us to test performance on a larger scale. 

First we evaluate the core WhittleSearch system with 
user-initiated feedback. These results aim to establish the 
value of relative attribute feedback compared to traditional 
binary relevance feedback. Since there is no active selection 
and we do not need to estimate entropy reduction in these 
results, we simplify the probabilistic relevance function in 
Eqn. [^to use binary values for the probabilities P{St,i = 
such that the relevance function simply counts the 
number of constraints satisfled by a database image li . Specif¬ 
ically, this corresponds to deflning: 


4.2 WhittleSearch Results 

We use Mechanical Turk to gather human feedback for our 
relative attribute method and the binary feedback baseline. 
We pair each target image with 16 reference images. For our 
method we ask, “Is the target image more or less (attribute 
name) than the reference image?” (for each (attribute name)), 
while for the baseline we ask, “Is the target image similar to 
or dissimilar from the reference image?” We also request 
a confldence level for each answer, as shown above in Fig¬ 
ure We get each pair labeled by up to flve workers and use 
majority voting to reduce noise. When sampling from these 
constraints to impose feedback, we take those that have the 


( 13 ) 

where the brackets denote Iverson bracket notation. 

Impact of iterative feedback. First we examine how the rank 
of the target image improves as the methods iterate. Both 
methods start with the same random set of 16 reference im¬ 
ages, and then iteratively obtain eight automatically gener¬ 
ated feedback constraints, each time re-scoring the data to 
revise the top reference images. To ensure new feedback ac¬ 
cumulates per iteration, we do not allow either method to 
reuse a reference image. 
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Fig. 8 Iterative search with WhittleSearch vs. traditional binary relevance feedback on three datasets. We show accuracy (percentile rank of the 
target image) as a function of the number of iterations of feedback. Our method often converges on the target image more rapidly. 


Shoes Scenes Faces 





Fig. 9 Ranking accuracy as a function of amount of feedback. While more feedback enhances both our method and the traditional binary relevance 
feedback approach, the proposed attribute feedback yields faster gains per unit of feedback. 


Figure shows the results, for 50 such queries. Our 
method outperforms the binary feedback baseline for all 
datasets, more rapidly converging on a top rank for the tar¬ 
get image. On Faces our advantage is slight, however. We 
suspect this is due to the strong category-based nature of 
the Faces data, which makes it more amenable to binary 
feedback; adding positive labels on exemplars of the same 
person as the target image is quite effective. In contrast, on 
Scenes and Shoes, where images have more fluid category 
boundaries, our advantage is much stronger. The searches 
tend to stabilize after 2-10 rounds of feedback. The run¬ 
times for our method and the baseline are similar. 

Impact of amount of feedback. Next we analyze the impact 
of the amount of feedback, using automatically generated 
constraints. Figure shows the rank correlation results for 
100 queries. These curves show the quality of all top-ranked 
results as a function of the amount of feedback given in a 
single iteration. Recall that a round of feedback consists of 
a relative attribute constraint or a binary label on one image, 
for our method or the baseline, respectively. For all datasets, 
both methods clearly improve with more feedback. How¬ 
ever, the precision enabled by our attribute feedback yields 
a greater “bang for the buck”—higher accuracy for fewer 
feedback constraints. The result is intuitive, since with our 
method users can better express what about the reference 
image is (ir)relevant to them, whereas with binary feedback 
they cannot. 

A multi-attribute query baseline that ranks images by 
how many binary attributes they share with the target im¬ 


Dataset-Method 

Near 

Far 

Near-i-Far 

Mid 

Shoes-Attributes 

.39 

.29 

.40 

.38 

Shoes-Binary 

.12 

.05 

.27 

.06 

Faces-Attributes 

.60 

.41 

.58 

.52 

Faces-Binary 

.39 

.21 

.64 

.15 

Scenes-Attributes 

.53 

.27 

.52 

.40 

Scenes-Binary 

.18 

.18 

.32 

.11 


Table 1 Ranking accuracy (NDCG@50 scores) as we vary the type of 
reference images available for feedback. Bold values indicate the best 
performance in a row. 

age achieves NDCG scores 40% weaker on average than 
our method when using 40 feedback constraints. This re¬ 
sult supports our claim that binary attribute search lacks the 
expressiveness of iterative relative attribute feedback. 

Impact of reference images. The results thus far assume that 
the initial reference images are randomly selected, which 
is appropriate when the search cannot be initialized with 
keyword search. We are interested in understanding the im¬ 
pact of the types of reference images available for feedback. 
Thus, we next control the pool of reference images to con¬ 
sist of one of four types: “near”, meaning images close to 
the target image, “far”, meaning images far from the target, 
“near-i-far”, meaning a 50-50 mix of both, and “mid”, mean¬ 
ing neither near nor far from the target. Nearness is judged 
in the GIST/color feature space. 

Table shows the resulting accuracies, for all types and 
all datasets using 100 queries and automatic feedback. Both 
methods generally do well with “near-i-far” reference im- 
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Shoes Scenes Faces Shoes: Keyword search 



Fig. 10 Ranking accuracy with human-generated feedback with randomly chosen (first three plots) and keyword-initialized reference images 
(fourth plot). 


ages, which makes sense. For attributes, we expect useful 
feedback to entail statements about images that are similar 
to the target overall, but lack some attribute. Meanwhile, for 
binary feedback, we expect useful feedback to contain a mix 
of good positives and negatives to train the classifier. We 
further see that attribute feedback also does fairly well with 
only “near” reference images; intuitively, it may be difficult 
to meaningfully constrain precise attribute differences on an 
image much too dissimilar from the target. 

Ranking accuracy with human-given feedback. Having an¬ 
alyzed in detail the key performance aspects with automati¬ 
cally generated feedback, now we report results using human¬ 
generated feedback. Figure shows the type of interface 
we used for these experiments. At the top, we show users 
images from the bottom and top of our attribute rankers, 
in order to guide their answers and ameliorate the effect 
of the discrepancy between machine and user understanding 
of an attribute. Figure (first three plots) shows the rank¬ 
ing correlation for both methods on 16 queries per dataset 
after one round of 8 feedback statements. Attribute feed¬ 
back largely outperforms binary feedback, and does simi¬ 
larly well on Scenes. One possible reason for the scenes be¬ 
ing less amenable to attribute feedback is that humans seem 
to have more confusion interpreting the attribute meanings 
(e.g., amount of perspective on a scene is less intuitive than 
shininess on shoes). 

Next, we consider initialization with keyword search. 
The Shoes dataset provides a good testbed, since an online 
shopper is likely to kick off his search with descriptive key¬ 
words. Figure (fourth plot) shows the ranking accuracy 
results for 16 queries when we restrict the reference images 
to those matching a keyword query composed of three at¬ 
tribute terms. Both methods get four feedback statements 
(we expect less total feedback to be sufficient for this setting, 
since the keywords already narrow the reference images to 
good exemplars). Our method maintains its clear advantage 
over the binary baseline. This result shows (1) there is in¬ 
deed room for refinement even after keyword search, and 
(2) the precision of attribute statements is beneficial. 

Figure (a) shows a real example search using relative 
feedback in WhittleSearch. Note how the user’s mental con¬ 
cept is quickly met by the returned images. Furthermore, 






Less young 






Fig. 13 A failure of our method. While the images our method re¬ 
trieves do match the descriptions given by the user, in this case we fail 
to retrieve an image of the correct person. This failure may be due to 
the insufficiently rich description that the user provided. 


the user can comment very specifically on the heel height, 
by referring to both a very high-heeled shoe (in Round 1) 
and a shorter-heeled shoe (in Round 2). This example high¬ 
lights the value of relative feedback: the user can precisely 
bound the range of acceptable strengths for each particular 
attribute. In some cases, however, binary relevance feedback 
might be sufficient. In Figure our method retrieves the 
correct images according to the user’s descriptions. But if 
the goal is to retrieve images of the person in the query, our 
method fails, while the binary relevance feedback method 
succeeds (not shown). To combine the strengths of both ap¬ 
proaches, we proposed a hybrid feedback approach in Sec¬ 
tion Figure [T^(b) shows a real example using a hybrid 
of both binary and attribute feedback, as described in Sec¬ 
tion This suggests how a user can specify a mix of both 
forms of input, which are often complementary. 

In Figure (c, d), we present two real examples of 
search results for human-generated feedback with Whittle¬ 
Search, to compare our method qualitatively alongside the 
traditional binary relevance feedback approach. Each exam¬ 
ple shows one search iteration, where the 20 reference im¬ 
ages are randomly selected (rather than ones that match a 
keyword search as the examples above) and annotated with 
constraints by users on MTurk. For each result, the upper fig¬ 
ure shows our method and the lower figure shows the binary 
feedback result for the corresponding target image. This fig- 
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Attribute; poinF\ -at-the-front 

less 


■». K 




... 


Attribute; feminine 

less 



more 

QUESTION: 

Images you can reference in the question answer below: 

2 

4 - 5 

6 7 i 



I zim very certain of my response. 


[ Next search roun^ 

Fig. 11 The interface we use for the live user experiments for WhittleSearch. The top rows illustrate the attribute meaning visually with examples. 
Then the user is shown a series of eight reference images, and asked to compare a target image to a reference image of his choosing according to 
an attribute of his choosing using the drop-down boxes. Finally, he must state his confidence in the response. 



Class 

Instance 

Shoes 

26.10% 

22.89% 

Scenes 

38.92% 

33.41% 

Faces 

28.38% 

30.16% 


Table 2 Errors for class-level vs. image-level training. 


ure shows the clear advantage of our relative attribute feed¬ 
back approach over traditional binary feedback. The user 
can retrieve more accurate results if he is allowed to com¬ 
pare the retrieved results to his target image for some partic¬ 
ular visual property. 


Consistency of relative supervision types. Next we examine 
the impact of how human judgments about relative attributes 
are collected to train the relative attribute models. 

For all results above, we train the relative attribute rankers 
using image-level judgments. How well could we do if sim¬ 
ply training with class-based supervision, i.e., “coasts are 
more open than forests”? To find out, we use the relative or¬ 
dering of classes given in |Parikh and Grauman] ( |2011b| ) for 
Faces and Scenes, and define them ourselves for Shoes (see 
Appendix). We train ranking functions for each attribute us¬ 
ing both modes of supervision. 


Table shows the percentage of ~200 test image pair 
orderings that are violated by either approach. Intuitively, 
instance-level supervision outperforms class-level supervi¬ 
sion for Shoes and Scenes, where categories are more fiuid. 

In additional experiments with 20 MTurk annotators, we 
find that the MTurkers’ inter-subject disagreement on 
instance-level responses was only 6%, versus 13% on category- 
level responses. Both results support the proposed design for 
relative attribute training. 

In Figure[T^ we show some examples where the instance- 
level ordering of two images with respect to some attribute 
differs from the ordering defined at the class-level. We show 
annotations where users had high confidence of these labels, 
and there was high inter-user agreement. 


4.3 Active WhittleSearch Results 

We next test how well the active variant of our method guides 
the search process using attribute pivots, by comparing it 
to several alternative methods for interactive search. Unless 
otherwise noted, we report results over 200 randomly cho¬ 
sen target images. 
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(c) (d) 

Fig. 12 (a) Example iterative search result with attribute feedback, (b) Example search result with hybrid feedback, (c, d) Example results for 
WhittleSearch (top) vs. binary relevance feedback (bottom) on Shoes (c) and Scenes (d). Eor the Shoes example, while both methods retrieve 
high-heeled shoes, only our method retrieves images that are precisely as open as the target image. This is because using the proposed approach, 
the user was able to comment explicitly on the desired openness property. Eor the Scenes example, we show an interesting example of a target 
image that is hard to describe in words and likely has few very similar images in the database. However, through our relative attribute constraints, 
we are able to retrieve better matches than the binary feedback baseline produces. A main issue for the baseline in this case is the lack of similar 
images among the reference images that the user can use to define positives. 
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than 



Fig. 14 Examples of dramatic disagreement between class-level and 
image-level annotations. For example, pumps are normally high at the 
heel and clogs diVe flatter, but the pump in the third row is lower at the 
heel than this particular clog. Inside-city images usually show a whole 
scene photographed down the street, but the inside-city scene in the 
top-right is the side of a building, and thus less in-perspective than the 
coast image. Jared Leto is usually not smiling, but in this particular 
picture (bottom-right) he is more smiling than Alex Rodriguez. 


Baselines. We compare our Active WhittleSearch method, 
denoted Active attribute pivots, against the following 
six baselines: 


- Attribute pivots is a simplified version of our method 
that uses the attribute trees to select candidate images, 
but chooses randomly among the attributes in a round- 
robin fashion. 

- Active attribute exhaustive uses entropy to se¬ 
lect questions like our method, but it evaluates all possi¬ 
ble MxN candidate questions, where M is the number 
of attributes and N is the number of database images. 

-Top selects the image that has the current highest prob¬ 
ability of relevance and pairs it with a random attribute. 
This method represents traditional interactive methods 
that assume an “impatient” user for whom feedback ex¬ 
emplars and search results must be one and the same. 
It is like the non-active version of WhittleSearch, except 
that it presents only one reference image and allows only 
one statement to be given at each time. Unlike Whittle¬ 
Search, the user of the system cannot introduce variety 
in the feedback statements that are given, as he cannot 
exercise choice. 


- Passive simply selects a random image paired with a 
random attribute for its question. 


- Active binary feedback does not use statements 
about the relative attribute strength of images, but rather 
asks the user whether the exemplar is similar to the tar¬ 
get. This method uses a binary SVM to rank images, and 
treats similar images as positives and dissimilar images 
as negatives. It actively chooses the image whose deci¬ 
sion value is closest to 0, as in ( |Tong and Chang | [200 1| ). 


- Passive binary feedback works as above, but ran¬ 
domly selects the images for feedback. 

Note that the relative feedback methods all use the same 
relevance prediction function and only differ in the feedback 
they gather. The tree-based methods stop asking questions 
about attribute m once its leaf is reached or the user has 
given an “equally” response for m. All methods keep an im¬ 
age in consideration for feedback until all possible questions 
have been asked about it. 

To thoroughly test the methods, we conduct both live 
experiments with real users as well as experiments where 
we simulate the user responses. We generate the response 
for a question, “Is the target image more, equally, or less 
m than Ip^ ?” using the difference in the predicted attribute 
values for the target It and the pivot Ip ^. For a response of 
“equally”, we use a threshold derived from the training pairs 
of images labeled as similar with respect to m. Note that this 
protocol is in line with standard validation for active learn¬ 
ing, where the algorithm receives the labels for those exam¬ 
ples it queries, even if a human is not answering “live” in the 
loop. The predicted attribute values are an extrapolation of 
the ground-truth labels we have obtained from users. We ini¬ 
tialize all attribute search methods with the same feedback 
constraint. 

For binary relevance feedback, we respond with “simi¬ 
lar” if the target and exemplar images are within one stan¬ 
dard deviation of the learned distances used for the ground 
truth ranking. We initialize the baseline with one positive 
and one negative image by peeking at the distances between 
the target image and a pool of 40 images, and selecting the 
closest image as a positive and the furthest as a negative. 
This simulates a user starting the search with feedback on 
a page of random images. If anything, it is generous to the 
baseline, since our method gets only one “bit” of feedback 
at the onset, while the binary feedback baselines get two. 

We again add Gaussian noise to both the relative at¬ 
tribute feedback and binary feedback methods in order to ac¬ 
count for the discrepancy between perceived and predicted 
attributes and appearance. 

Comparison of likelihood models. Figure compares the 
three proposed methods of predicting the user response. MOST 
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Fig. 15 Comparison of the proposed models for the likelihood of a 
user’s response. Best viewed in color. 
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Fig. 16 Comparison of Active WhittleSearch to alternative interactive 
search methods on the three datasets. For both metrics, higher curves 
are better. Best viewed in color. 


Relevant consistently performs well on all datasets, and 
outperforms the other two methods on all but the Scenes. 
This suggests that our best guess at the target tends to be a 
sufficient proxy, having a fairly similar attribute signature. 
All Relevant performs similarly but is slightly weaker, 
indicating that isolating the most relevant instance gives a 
“cleaner” likelihood than attempting to refine it with our un¬ 
certainty about each relevant instance. Similar Question 
performs the best for a fraction of the iterations on Scenes, 
but does poorly on Faces. This is likely because we cannot 
estimate attribute similarity reliably due to the distinct face 
attributes (e.g., face chubbiness has no strongly correlated 
attributes, whereas scene openness does). In all remaining 
results, we use the Most Relevant method. 


Method/Dataset 

Shoes 

Scenes 

Faces 

Active attribute pivots (Ours) 
Active attribute exhaustive 

0.05 

656.27 

0.01 

28.20 

0.01 

3.42 


Table 3 Selection time for one iteration of our method vs. the exhaus¬ 
tive active baseline, in seconds. 


our method finds the target image most efficiently. We see 
that our full active approach outperforms the round-robin 
variant of our method (Attribute pivots), with an aver¬ 
age percentile rank 7.6% better after only 3 iterations. This 
shows actively interleaving the trees allows us to focus on 
attributes that better distinguish the relevant images. 

Our method is also more effective than Active attribute 
exhaustive]^ This shows that the binary tree structures 
serve as a form of regularization, helping our method focus 
on those questions that a priori may be most informative to 
ask. Intuitively, if a user has ruled out a subtree (“The target 
image is bluer than the reference image with blueness X.”), 
it is likely redundant (low information gain) to ask how the 
target compares to more data on that path (“Is the target im¬ 
age bluer than this other reference image with blueness X 
- Y?”), i.e., to ask the user to comment on something even 
less blue than the previous exemplar. The exhaustive method 
might be more prone to selecting outliers which are not ac¬ 
tually informative, due to potential noise in the active selec¬ 
tion which arises out of the need to estimate the likelihood of 
different user responses. In contrast, our method picks piv¬ 
ots which, even if there are small errors in the entropy esti¬ 
mation, will be informative as they split the search space in 
half. Furthermore, our method is orders of magnitude faster 
(see Table [^. 

The results in Figure[T^also show the striking advantage 
of relative attribute feedback compared to binary relevance 
feedback, as we also demonstrated in the previous section. 
Binary feedback has an advantage in the first few iterations, 
likely because we generously initialize it with two feedback 
statements. However, the relative attribute methods quickly 
surpass binary feedback. We find that both feedback modes 
require similar user time: 6.4 s for relative, and 5.5 s for bi¬ 
nary, and so the trends remain if we plot rank as a function 
of user time. Interestingly, we find that PASSIVE BINARY 
EEEDBACK is actually stronger than its active counterpart 
for this data. This is likely because images near the deci¬ 
sion boundary are often similar (and negative), whereas the 
passive approach samples more diverse instances (and hence 
gets more positives). 

Finally, we outperform TOP, showing that relative at¬ 
tribute feedback alone need not offer the most efficient search. 
Rather, it is important to give comparative constraints on 
well-chosen images. 


Comparison to existing methods. Figure |16| compares our 3 exhaustive baseline was too expensive to run on all 14K 
method to the six baselines on all three datasets. Overall, Shoes. On a 1000-image subset, it does similarly as on other datasets. 
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For reference, illustration of what we mean by "more" and "less": 


Attribute: high-at-the-heel 


less 








I am very certain ▼ of my response. 


[ Next search round~~| 


Fig. 17 The interface we use for the live user experiments for Active WhittleSearch. The top row illustrates the meaning of the system-selected 
attribute for this round of feedback. Then the user is asked to compare the displayed target image to the selected reference image according to that 
attribute, by selecting “more”, “less”, or “equally”. Finally, he must state his confidence in the response. 


In practical terms, we are interested in how many itera¬ 
tions it takes to get the target in the top 40 most relevant im¬ 
ages, since that is how many images fit on a typical search 
page (e.g., on Google). On average our method uses 12, 10, 
and 4 iterations to place the target in the top 40 for Shoes, 
Scenes, and Faces, vs. 21, 21, and 9 iterations for Top. Thus, 
our method saves a user up to 70 seconds per query. 

We also tested a method that does a hard pruning of im¬ 
ages on the irrelevant branches of an attribute tree, as dic¬ 
tated by user feedback. It incorrectly eliminates the true tar¬ 
get for about 93% of the queries, clearly supporting the pro¬ 
posed probabilistic formulation. 


Results with live users. Next, we test our method “live” in 
real time with Mechanical Turk workers, using an interface 
similar to the one shown in Figure We compare its per¬ 
formance against the two strongest baselines. Attribute 
PIVOTS and Top. The workers answer a series of five ques¬ 
tions that each of the three methods pose about the same 
target image. We issue 50 queries for Shoes-Ik (a random 
1000-image subset of Shoes), Scenes, and Faces-Unique (a 
set of one image for each of 200 individuals from the orig¬ 
inal PubFig dataset ( Kumar et~S| 2009), using the six most 
reliably predictable attributes). We eliminate any queries where 
one or more methods did not receive five complete feedback 
iterations. All methods share one simulated feedback state¬ 
ment at iteration 0, which we do not plot. We stop updating 
the probabilities of relevance for a method once this method 
places the target image in the top 40 images. 


In order to get richer feedback from users, we allow 
users to express their confidence in their responses, and give 


Shoes-lk ^ Scenes Faces-Unique 



Iteration Iteration Iteration 

Active attribute pivots -o-i Attribute pivots Top 

Fig. 18 Our Active WhittleSearch method makes quick and reliable 
choices, allowing the MTurk users to more efficiently find the target. 

twice the weight to constraints for which the user says “a lot 
more (less)” when computing the relevance probabilities. 

Note that this live experiment is only possible because 
our method can make decisions in real time, unlike the ex¬ 
haustive active learning method. 

Figure \T^ shows the results. Consistent with our simu¬ 
lated user results above, we see that typically our method 
ranks the target image better than the baselines do. We find 
this a very encouraging result, given the noise inherent in 
MTurk responses (in spite of our best efforts at qualification 
tests) and the difficulty of predicting all attributes reliably. 
Our informativeness predictions on Faces-Unique are im¬ 
precise since the facial attributes are difficult for both the 
system and humans to compare reliably (e.g., it is hard to 
say who among two white people is whiter). This difficulty 
seems to hurt all methods, judging by their flatter curves. 
Since the rank metric does not give any credit for finding an 
image very close to the target, we also asked a separate set 
of workers to judge whether any of the top 10 ranked images 
were “very similar” to the target. For Shoes-Ik, our method 
takes only 1.9 iterations on average to find one that is very 
similar, whereas Attribute pivots requires 2.4 and Top 
requires 3.15. 
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(a) Shoes-Ik 
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Final top 5 relevant images 


Fig. 19 Example live user search results comparing our method (top) to the Top baseline (bottom) on Shoes-lk (a) and Scenes (b). Using the 
user’s feedback on the left, we retrieve the images on the right at the top of the results list. See text for details. 


Figure presents examples of real live searches done 
by workers on Mechanical Turk with the Active Whittle- 
Search system. We show how our method and TOP rank im¬ 
ages (shown on the right-hand side) based on the supplied 
user feedback (shown on the left-hand side). In each figure, 
our method is shown on top, followed by TOP underneath. 
For simplicity, we show “a lot more/less” responses as sim¬ 
ply “more/less”. In Figure [T^a), we see how our method 
quickly converges on shoes that look like the target (bright 
high-heeled pointy shoes). Our method asks questions that 
are crucial in describing the shoe precisely (it is a high- 
heeled but not 2 i formal shoe, and it is more open than other 
high-heeled shoes). In contrast, TOP gets stuck asking ques¬ 
tions about the same shoe, and moreover, asking questions 
whose answers might be redundant (i.e., about sportiness 
and its near-opposite/pmmm//y). In Figurep^b), our method 
asks about properties that are important for distinguishing 
the target image from other images, namely open-air. Only 
our method is able to provide acceptable top results. 


4.4 Comparing WhittleSearch and Active WhittleSearch 

So far, we have demonstrated the advantages of relative at¬ 
tribute feedback, as well as the benefit of actively selecting 
the images shown for such relative attribute feedback. We 
have also discussed the conceptual advantages of the user- 
guided version of WhittleSearch and its system-guided ac¬ 
tive selection version, in Section [331 

Next, we compare the two versions of our method exper¬ 
imentally, using the Shoes dataset. We conduct experiments 
where users provide one feedback statement at each of five 
iterations, whether that is chosen by the user from among 
those that are ranked highest at the previous iteration (for 
WhittleSearch), or actively chosen by the system (for Ac¬ 
tive WhittleSearch). Each of 20 queries is submitted to five 
workers, and each worker completes the task for the same 
query for both methods. We time the user responses at each 
iteration. We manually remove outliers in terms of time, and 
queries for which the users provided obviously incorrect re¬ 
sponses, for both methods. 
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Fig. 20 Comparison of WhittleSearch (WS) and Active WhittleSearch (AWS). (a) System entropy for WhittleSearch (WS) and Active Whittle- 
Search (AWS) (lower is better), (b) Percentile rank of target vs. time required for feedback (higher rank and lower time are better), (c) Total time, 
with rank converted to time (see text), (d) Confidence of human responses for WhittleSearch (WS) and Active WhittleSearch (AWS). 


In Figure (a), we demonstrate that Active Whittle¬ 
Search does indeed reduce the overall entropy of the sys¬ 
tem better than WhittleSearch, which is the objective that 
Active WhittleSearch uses when selecting comparisons for 
feedback. We plot how entropy decreases as the system re¬ 
ceives more feedback over five iterations. The entropy esti¬ 
mates in the first few iterations are inaccurate due to the sys¬ 
tem having received too little feedback to estimate relevance 
accurately. This likely explains why Active WhittleSearch is 
initially weaker at reducing entropy, but after two iterations, 
it starts to reduce entropy faster than WhittleSearch, thus 
achieving its main objective. 

Next, we examine how entropy reduction affects the ac¬ 
tual user experience, as measured by the success of search 
results as a function of the amount of feedback effort. In 
Figure [^(b), we plot the median final percentile rank of the 
target image per query, and the median total time it took to 
provide all feedback statements for that method. The time 
for feedback captures the time that users spend to examine 
the reference images and attribute vocabulary and consider 
the possible combinations thereof they can use for a feed¬ 
back statement, as well as the time they spend actually sub¬ 
mitting the selected feedback. If no options are given and 
the system simply presents the human user with a single 
question, then the time for feedback simply involves decid¬ 
ing on the answer to that question (i.e., “more”, “less”, or 
“equally”). Since WhittleSearch gives the user more free¬ 
dom and the user needs to examine options and select among 
them, that version requires more time for feedback than the 
active version, which could potentially be a disadvantage to 
an impatient user. That said, WhittleSearch often achieves 
high accuracy rates as a payoff for the user time invested. 

To better depict how the user effort and quality of results 
are tied together, we next devise a unified metric for eval¬ 
uation; see Figure (c). This metric measures both how 
long it takes to provide a specific form of feedback, and how 
effectively this feedback enables the system to retrieve re¬ 
sults, captured by the rank of the target image. In particular, 
we sum the time for providing the feedback and the time re¬ 
quired to examine the results. The latter term corresponds to 


the rank of the target image converted to time, using a vary¬ 
ing number of seconds that are required to examine a page 
of 40 images. In other words, if the target image is shown at 
rank 70, it will be on page two of the search results, and if it 
takes 4 seconds to examine a page, the total time to examine 
the results will be 8 seconds. We plot results as a function 
of the time to examine a page because examining a page of 
results can take a short amount of time—if the target image 
has very prominent and easy to spot distinctive features or if 
all of the results are obviously very different than the target 
image—or more time—if some of the results are similar to 
the target and the user needs to look more carefully to deter¬ 
mine if there is an actual match. We find that perusing a page 
of 40 image results takes 5.7 seconds on average, hence the 
choice of range we use on the x-axis of Figure [^(c). 

In Figure [^b), we see that Active WhittleSearch is 
cheaper in terms of user time, but achieves slightly worse 
ranks for the target image. Because WhittleSearch achieves 
better ranks than Active WhittleSearch on average but is 
slower to use, the user-guided version outperforms the system- 
guided one when the cost of examining a page of results 
starts to dominate the cost of providing feedback, as seen 
in Figure [20|c). This result illustrates how different versions 
of the WhittleSearch system might be preferable in different 
contexts and for different tasks. 


To examine possible reasons for the performance of the 
two versions of the system, in Figure (d) we show a 
histogram of the confidences that users reported for their 
responses. We plot the average certainty that the user pro¬ 
vided over the five iterations, with 3 being most certain and 
1 being uncertain. We see that human responses on Whit¬ 
tleSearch are much more certain than those for its system- 
guided counterpart, likely because users often comment on 
the most obvious relationships of target and reference im¬ 
ages when they are given a choice. This explains Active 
WhittleSearch’s inferior performance in terms of rank, in 
Figure[^(b). However, we observe that when all five MTurk- 
ers agree on all of the Active WhittleSearch responses, which 
occurred for one query. Active WhittleSearch is better. Fig¬ 
ure 21 shows this example for one of the five users. This is 
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Active WhittleSearch 



WhittleSearch 



Fig. 21 A case where Active WhittleSearch is most useful. Observe 
the discriminative questions selected by the active system—not only in 
terms of attributes like bright-in-color and long-on-the-leg, but also in 
terms of the images involved in the comparison along those attribute 
dimensions. For example, the user of WhittleSearch chooses to com¬ 
ment on the relevant long-on-the-leg property, but there are a lot more 
images that are less long-on-the-leg than a boot (bottom), compared to 
those that are less long-on-the-leg than a pump (top). 


encouraging because it indicates that if we can pick feed¬ 
back requests that are informative and also likely to be an¬ 
swered with confidence, our active approach can produce 
even more accurate search results. Thus, a natural direction 
for future work is to incorporate a user-confidence model 
into the system. 


5 Conclusion 

We proposed an effective new form of feedback for image 
search using relative attributes. In contrast to traditional bi¬ 
nary feedback, our approach allows the user to precisely 
indicate how the results compare with his mental model. 
Building on this idea, we develop a system-guided version 
of the method which actively engages the user in a relative 
20-questions-like game, where the answers are visual com¬ 
parisons. Compared to existing active and passive methods, 
our pivot-based formulation is both more efficient (by orders 
of magnitude) and more accurate in practice. 


In-depth experiments with three diverse datasets show 
relative attribute feedback’s clear promise, and suggest inter¬ 
esting new directions for integrating multiple forms of feed¬ 
back for image search. Results demonstrate that our system- 
guided approach can rapidly pinpoint the visual target using 
a series of well-chosen comparative queries. 

In future work, we plan to explore ways to more fully 
model uncertainty in the search system. This can include, for 
example, representing the user’s confidence when comput¬ 
ing our active selection criteria, or accounting for the confi¬ 
dence of the attribute models themselves. Furthermore, we 
would like to encourage diversity in the questions we ask 
the user, incorporate strategies for ensuring that the ques¬ 
tions we ask are not too difficult, and develop an approach 
where control can be adaptively transferred between the user 
and the system. We will study ways to efficiently learn a new 
attribute on the fiy, to allow the user to define new attributes 
when the current vocabulary is no longer useful. We are also 
interested in developing ways to allow for more exploration 
during search, and for assignment of different weights to 
feedback on different attributes. 
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Appendix A 


Attribute/Class 

Athletic 

Boots 

Clogs 

Flats 

Heels 

Pumps 

Rain Boots 

Sneakers 

Stiletto 

Wedding 

Pointy at the front 

2 

6 

3 

5 

10 

9 

4 

1 

8 

7 

Open 

3 

2 

8 

5 

7 

6 

1 

4 

9 

10 

Bright in color 

6 

1 

2 

8 

4 

3 

10 

7 

9 

5 

Covered w/ ornaments 

4 

9 

6 

5 

8 

7 

1 

3 

10 

2 

Shiny 

2 

9 

4 

3 

6 

5 

8 

1 

10 

7 

High at the heel 

4 

6 

5 

1 

9 

8 

3 

2 

10 

7 

Long on the leg 

7 

9 

2 

3 

6 

5 

10 

8 

4 

1 

Formal 

3 

6 

4 

7 

9 

8 

1 

2 

5 

10 

Sporty 

10 

5 

6 

7 

4 

3 

8 

9 

1 

2 

Feminine 

1 

6 

4 

5 

10 

9 

3 

2 

8 

7 


Table 4 Ordering of classes for the attributes in the Shoes dataset. A score of 10 denotes that this class has the attribute the most, and 1 denotes the class 
has it the least. 




